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A SIMPLE ALGORITHM FOR MERGING TWO DISJOINT 
LINEARLY ORDERED SETS* 


F. K. HWANG np S. LINt 


Abstract. In this paper we present a new algorithm for merging two linearly ordered sets which 
requires substantially fewer comparisons than the commonly used tape merge or binary insertion 
algorithms. Bounds on the difference between the number of comparisons required by this algorithm 
and the information theory lower bounds are derived. Results from a computer implementation of 
the new algorithm are given and compared with a similar implementation of the tape merge algorithm. 
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1. Introduction. Suppose we are given two disjoint linearly ordered subsets 
A and B of a linearly ordered set S, say 


A= {a, <a, <-+:: <a}, 
B= {b, <b, <-:: < b,}. 


The problem is to determine the linear ordering of their union (1.e., to merge A 
and B) by means of a sequence of pairwise comparisons between an element of 
A and an element of B. Given any algorithm s to solve this problem, we are inter- 
ested in the maximum number of comparisons, K,(m,n), required under all 
possible orderings of A U B. An algorithm s is said to be M-optimal if K,(m, n) 
= K(m,n), where K(m,n) = min, K,(m,n). In this paper, we give a simple 
algorithm for solving this problem, called the generalized binary algorithm g, 
and derive some bounds for K,(m,n) — K(m,n) which are substantially better 
than two other known algorithms. 


2. Some preliminary discussions and results. Let the cardinality of A and B 
be m and n respectively. We assume m < n. Let QZ, be the set of all possible order- 
ings of A U B and Q, be the subset of Z consistent with the results of the first 
k comparisons we have made thus far. It is clear that, after making the ith compari- 
son, i = 1,2,---, k, one of the two possible outcomes must have |Y| = 3|9,_ || 
and that merging is achieved if and only if Y, contains exactly one element. Since 


mrn : 
Do has | | elements, or as we say, data points, we must have, for any algorithm 
n 


s, 


" + ")| 
K,(m,n) 2 |log, = I(m,n). 
m 


I(m, n) is usually called the information theory bound. 
For m = 1, the binary insertion algorithm is optimal and hence 


(1) K(1,n) = I(1,n) = flog, (n + 1). 
* Received by the editors August 30, 1971. 


+ Bell Laboratories, Murray Hill, New Jersey 07974. 
' As usual, we let [x] denote the smallest integer => x and |x | the largest integer < x. 
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In a recent paper [1], the authors constructed an M-optimal algorithm for 
m = 2 and thereby determined the values of K(2,n). It can also be shown that [2] 


(2) K(m,n)=m+n—1 for3<msSnsm+4+3 
and 
(3) K(m, 2m) S$ 3m — 2 form = 3. 


The determination of K(m,n) for m = 3 appears to be a very difficult problem. 


3. Two existing algorithms. For the purpose of comparing with the generalized 
binary algorithm to be presented in the next section, we mention two existing 
algorithms. 

I. The “tape merge” algorithm t. The “tape merge’’ algorithm is the com- 
monly used procedure to merge two tapes or lists of sorted items. It can be described 
by the following steps (details of storing and stop conditions are omitted): 

TM1. Compare a,, with b,,. 

TM2. Ifa,,< b,,setn =n — 1and goto TMI. 

TM3. Ifa,, > b,, set m = m — 1 and goto TMI. 

It can be easily shown that 


K,m,n)=m+n-— 1 


and hence the “‘tape merge”’ algorithm is M-optimal for n < m + 3 [2]. 

II. The “simple binary” algorithm s. The “‘simple binary” algorithm can be 
described by the following steps: 

SB1. Merge a,, into B by the binary search procedure. 

SB2. Pull out a,, and elements of B > a,,. (These are already in order and 
larger than the rest of the elements of A U B.) Set m = m — 1 and redefine mand n. 
(The new n = new m.) Go back to SBI. 

It is clear that under the worst possible outcome, a,, is always larger than 
b, and hence no element of B is discarded. Therefore, 


Km, n) = m[log, (n + 1)1. 


For m = 1, we have 
K (m,n) = K(m,n). 
However, we shall show in the next section that 
K(m,n) > K(m,n) form> 2. 


The distinctive feature of these two algorithms is their simplicity, although 
in general, they are quite inefficient in the sense that both K,(m,n) — K(m,n) or 
Km, n) — K(m,n) can be very large. In the next section, we shall present an 
algorithm which matches the two abovementioned algorithms in simplicity and 
yet improves a great deal on their efficiency. 


4. The generalized binary algorithm g. For the sake of simplicity, we shall 
assume that whenever we are required to merge two disjoint linearly ordered 
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sets with cardinalities x and y respectively, n will always refer to max (x, y) and 
m, to min (x, y), so thatn = m. 

The generalized binary algorithm may now be described as follows (again, 
details of storage and stop criteria are omitted): 

GBI. Let « = |log, (n/m) | and x =n — 2% + 1. 

GB2. Compare a,, with b,. If a,,< b,, pull out the set of all elements in 
B 2 b,,, say C. We are then left with the problem of merging two disjoint sets A 
and B — C. Redefine m and n and go back to GB1. (Note that B — C has n — 2° 
elements and we need to interchange the role of m and n if and only ifn = m.) 

GB3. Ifa,, > b,,mergea,, into the set C — b, by the simple binary algorithm. 
Note that C — b, has exactly 2” — 1 elements and a,, can be merged into the set 
in exactly « more comparisons. Pull out a,, and the set D of all elements in B > a,,. 
We are then left with the problem of merging the set A — a,, with the set B — D. 
Redefine m and n and go back to GBI. 

For this algorithm g, K,(m,n) is given by the following theorem. 

THEOREM 1. Let « = |log, (n/m)|. Write n = 2*m + 2*p + 0, where p and 0 
are uniquely determined nonnegative integers satisfying 0S p<m, OS 6 < 2%. 
Then K,(m,n) = (2 + «m+ p—1. 

Proof. Ifa =0,n =m + p, and it is clear that the worst possible data forces 
the algorithm g to be identical with the algorithm t discussed in the previous 
section. 

Hence K,(m,n) = K,m,n)=m+n—1=2m+ p—1. 

If m = 1, p must be zero and n = 2% + 0. It is clear that a,, > b, is the worst 
outcome and hence K,(1,n) = K(1,n) = 1+ «. 

We now prove Theorem 1 by induction on m + n. Assume the theorem true 
for all m’, n’ such that m’ + n’ < m +n, and for all m,n with « = 0, orm = 1. We 
prove the theorem true for m,n with a > 0 and m > 1. The theorem is trivially true 
form +n = 2. 

After making the first comparison of a,, with b,, we have two possibilities : 

(1) ad, < b,, and we are left with the problem of merging two sets with m 
and n — 2* elements. 

(ii) a,, > b,. After merging a,, into the set C — b, in « more comparisons, 
we are left, in the worst case, with the problem of merging two sets with m — 1 
and n elements. Hence 


K,(m,n) = max|[1 + K,(m,n — 2*),1 +a+ K,(m — 1,n)]. 


Now, 
2*m + 2%p — 1) + 0 ifp £0, 
n—2*= §2% *m4+ 2% '(m—1)+0-—2%7*' ifp=Oand 6 = 277}, 
27-'m + 2*-'(m — 2) + 0 if p = Oand 6 < 2%7?. 


Hence by induction, 
(2+a)m+(p— 1-1 ifp £0, 
K,(m,n — 27) = «(1 +a)m+(m—1)-1 ifp = Oand 6 = 2*7!, 
(1 + am + (m — 2)-—1 if p = Oand 6 < 2771, 
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Similarly, 
2(m —1) + 2(p+1)+0 ifp<m-—2, 
n= <2'*%m — 1)+ 0 ifp=m-—2, 
2'*%(m — 1) + 20+ 0 ifp=m-—1. 
Hence by induction, 
2Q+a(m—1)+(p+1-1 ifp<m-2, 
Km — 1,n) = 
(3 + a)(m—1)-1 otherwise. 
Therefore, 
(2Q+am+p—2 ifp=Oandé@ < 2%-}, 
1+ K,(m,n — 2*) = 
(2+a)m+p-—1_ otherwise, 
and 
(2+am+p—2 ifp=m-1l, 


l+a+K,m—1,n)= 
(2+a)m+p-—1_ otherwise. 


Since the conditions p = 0 and p=m -— 1 are mutually exclusive for m > 1, 
we have 


K,(m,n) = max [1 + K,(m,n — 2”),1 + «+ K,(m — 1,n)] 
=(2+a)m+ p—I1, 


and hence the theorem is proved. 
Comparing the general binary algorithm g with the tape merge algorithm t 
and the simple binary algorithm s, we have 
K,m,n) — K,(m,n) = (m+n — 1) — [Ca + 2)m + p — 1] 
=m+ %m+2%p+0—-1—-(a+2)m—pt+l 
= (2*—a — l)m+ (2% — 1)p + 0. 
Hence K,(m, n) = K,(m, n) only if « = 0, or a = 1 and p = 6 = 0. Otherwise, 
Km, n) — K,(m,n) =n — (a + 1)m — p > 0. Similarly, 
Km, n) — K,(m,n) = m{log, (n + 11 — (a + 2)m + p — 1] 
> ma + |log,(m+ p)| + 1) — (a + 2)m+p— 1] 
= m([log, (m+ p)—1]—ptl. 
Hence K,(m,n) = K,(m,n) only if m= 1, or m= 2, p= 1. Otherwise K,(m, n) 
—K,(m,n) 2 m([log, (m + p)| -1)-p+1>0. 
It is often convenient to refer to a set of numbers n,m, k) as the largest n 


such that K,m,n) < k. Table 1 gives some of these numbers for the algorithms 
t,s and g. Also we have for k = (2 + a)m + p — 1, 
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n(m,k) = 2%m + p + 1) — 1, 
nm, k) = (1 + «)m + p, 
n(m,k) = 2**? —1. provided 2*** —1 =m. 


TABLE 1 


‘ i ue 


5. Bounds on K,(m, n) — I(m,n). Let n = 2%m + 2%p + 0 with OS p <™m, 
050 <2%,a20;andk = K(m,n) = (2 + am + p — 1. 
THEOREM 2. K,(m,n) — I(m,n) Sm — 1. 


Proof. We have 
m+n 
I(m, n) ee tox, )| 
m 


and 
m+n , +" _ m+ Mp t+ e+ iM 
nt - nk m! 
_, 2im(l + pm)" ait + ply” yam+m—i+p 
m! m! 7 
since 
m! <m™2!-™ and (1 + p/m)" = 2?. 
Hence 
m+n 
oes > @+ Im + p-1, 
m 
and 
m+n 
to | ) = (a+ l)m + p. 
m 
Therefore, 


K,(m,n) — I(m,n) Sm — 1. 


36 F. K. HWANG AND S. LIN 


COROLLARY 1. Form > 1 and 9 = 2% — 1, 
K,(m,n) — I(m,n) Sm — 2. 
Proof. For m > 1 and 6 = 2% — 1, we have 


(ig ear 


m m! — m! 


and the proof parallels the proof of Theorem 1. 

For larger m, a much sharper bound for K,(m,n) — I(m,n) can be derived 
by means of Stirling’s formula. First we prove a lemma. 

LemMA 1. Let é = (1 + m/n)"” and x,, be defined by 


(n+ x,,)" =(n + m(n +m — 1)---(n + 1). 
Then 


Xm = max 1.5 +m) — nf. 
e 


Proof. Itis clear that x,, 2 1. From Stirling’s formula, we have 


n+ m)! 
(n+ Xp)" =(n+m(n+m—1)---(at+ yee 
./2nin + m)rtm* 1/2 eg (atm) +61 /(12(n+m)) Oy 
_ fal + myprmt 2 et mr nitan+m ! 
./2nn"* 1/2 g—n+t 62/(12n) a 05 = 


n+m m\" 
>,/ [i + “ (n + my" e7™ 1G 2n) 
n n 


> én + my"e"™ 


m+ n m\ B/mm/(2n) 
= [i a "| > J1M2n) = 4i/(4n) > ell 2n) 
VV n n 


since 


Some typical values of x,,/m are given below in Table 2. 


TABLE 2 


> 0.4715 > 0.4831 > 0.4907 > 0.5065 > 0.5279 
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THEOREM 3. Let ¢ = (6 + x,,)/2* and t = min(p + e,m). Then 


n= tow, "= ") 


=> (1+ a)m+ [t + dal, 


where 
log e 
Am = (log, C= 1)m = on - Liog, (2nm) 
0.12 
~ 0.442695m — at 4 log, (2nm). 
Proof. 
n+m) (n+ Xp)" 
m 7 m! 
eee (ee 
./2nmm” emt 1/(12m) 
= (2*m + 2"p vie ee ae om— 1/(12m) 
./2nmm" 
a (2*m)"(1 = (p + e)/m)" em 1/(12m) 
./2amm" 
> Jam +t + (logze)(m— 1/(12m)) ~loga/2nm 
Therefore 


m+n 
I(m,n) = to, | . ) 2(1+a)m+ [t+ q, |, 


which is to be proved. 
Since K(m,n) 2 I(m,n), we have the following corollary. 
COROLLARY 2. 


K,(m,n) — K(m,n) S K,(m,n) — (m,n) S m+ p—1-— [t+ ql. 


Table 3 gives some values for 


log, e 
= — 1)jm — Lid 2 ; 
Im = (log, e jm 12m x log, (2am) 
TABLE 3 


ES EI eee Se 
}=208 | 1.0008 —0.832 | —0.585 | —0.299 0.179 
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Note thatt > pandgq,, > —1form = 3so that Corollary 2 implies Theorem 2 
for m 2 3. For m 2 6, q,, > 0 and hence Corollary 2 also implies the conclusion 
of Corollary 1 regardless of the value of 6. For large m, say m > 100, we have 
K,(m,n) — I(m,n) < 0.6m — 1, and the “best”? bound occurs when « = 0, p = 0.52m, 
Xm © 0.48m, g, © 0.44m and this gives K,(m,n) — I(m, n) = 0.08m. 


6. Computational results. In this section, we discuss the storage requirements 
when the generalized binary algorithm g is implemented by a computer program 
and compare its running time with a similar program implementing the commonly 
used tape merge algorithm t. We assume that the sorted lists A,, and B, to be 
merged are stored on tapes (or other external devices) if they are too large to be 
accommodated in core. These can then be read in sequentially in sorted order 
as needed and the elements of the merged list C,,,,.,, written in similar sorted order 
onto output devices as soon as they are sequentially determined. As can be seen 
from the description of the algorithm g, for efficient comparison we need the 
elements a,, and those in B, from b, to b, in core. This requires a storage space 
of 2% elements (« = |log, (n/m) |) which is approximately equal to n/m. In general, 
this will not be excessive. For example, if n = 10’ and m = 10%, an average of 
10° elements of B are required to be in core and this ratio will be approximately 
maintained if the data in B, and A,, are uniformly distributed in some interval. 
If n/m becomes too large, a slight modification of the algorithm can be made, 
say, to compare a,, with b,, where x = n — 2° + 1 for some smaller £, without 
substantially affecting its efficiency. 

Assuming the data in A,, and B, are uniformly distributed in some interval, 
the expected number of comparisons E,(m, n) required by the tape merge algorithm 
t can be seen to satisfy the following recurrence relation: 


m n 
(R) E(m,n) = 1+ eae —1l,n)+ Aa — 1); Ej, 1) = 1. 


Solving (R), we have 


1 1 
E - sie oS oo 
(m,n) mn + 1 - n+ 1 


m n 
SSN are Tie? Sr ee ar 
. +1 m+ 1 
which is only slightly less than K,(m,n) = m+ n — 1. 

When n/m is large, as in the case of updating telephone directories or library 
materials, we see that E,(m,n) m+n — n/m can be considerably larger than 
K,(m,n) © (3 + [log, (n/m)|)m, the maximal number of comparisons required 
using the algorithm g. Even when the logic involved in making one comparison 
using the proposed algorithm g is more involved than making one comparison 
using the tape merge algorithm ¢t, substantial savings in computer time can be 
achieved. A computer program (FORTRAN, GE-635), implementing both the tape 
merge algorithm ¢t and the generalized binary algorithm g (hopefully with equal 
degrees of efficiency), was written to test our assertions on some problems with 
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randomly generated data. The results are presented in Table 4. As can be seen, 
the saving in time is great when n/m is large. 


TABLE 4 


C, = number of comparisons made by the tape merge algorithm t; 

C, = number of comparisons made by the generalized binary algorithm g; 
time (in milliseconds) spent in making the comparisons using t; 

= time (in milliseconds) spent in making the comparisons using g. 


g 
T, 


rape 
| 
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